A Feature-Rich CRF Segmenter for Chinese Micro-Blog

نویسندگان

  • Yabin Leng
  • Weiwei Liu
  • Sheng Wang
  • Xiaojie Wang
چکیده

This paper describes our system for Chinese word segmentation of micro-blog text, one of the NLPCC-ICCPOL 2016 Shared Tasks [1]. The CRF (Conditional Random Field) model is employed to model word segmentation as a sequence labeling problem, 7 sets of features are selected to train the CRF model. The system achieves fb 0.798144 on closed track, 0.81968 on semi-open track, and 0.82217 on open track with weighted measures [2].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adapting Conventional Chinese Word Segmenter for Segmenting Micro-blog Text: Combining Rule-based and Statistic-based Approaches

We describe two adaptation strategies which are used in our word segmentation system in participating the Microblog word segmentation bake-off: Domain invariant information is extracted from the in-domain unlabelled corpus, and is incorporated as supplementary features to conventional word segmenter based on Conditional Random Field (CRF), we call it statistic-based adaptation. Some heuristic r...

متن کامل

CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data

In this paper, we proposed a Chinese word segmentation model for micro-blog text. Although Conditional Random Fields (CRFs) models have been presented to deal with word segmentation, this is still the first time to apply it for the segmentation in the domain of Chinese micro-blog. Different from the genres of common articles, micro-blog has gradually become a new literary with the development o...

متن کامل

The Character-based CRF Segmenter of MSRA&NEU for the 4th Bakeoff

This paper describes the Chinese Word Segmenter for the fourth International Chinese Language Processing Bakeoff. Base on Conditional Random Field (CRF) model, a basic segmenter is designed as a problem of character-based tagging. To further improve the performance of our segmenter, we employ a word-based approach to increase the in-vocabulary (IV) word recall and a post-processing to increase ...

متن کامل

Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctua...

متن کامل

Micro blogs Oriented Word Segmentation System

We present a Chinese word segmentation system submitted to the first task on CLP 2012 back-offs. Our segmenter is built using a conditional random field sequence model. We set the combination of a few annotated micro blogs and People Daily corpus as the training data. We encode special words detected by rules and information extracted from unlabeled data into features. These features are used t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016